Research connecting text and images has recently seen several breakthroughs, with models like CLIP, DALL-E 2, and Stable Diffusion. However, the connection between text and other visual modalities, such as lidar data, has received less attention, prohibited by the lack of text-lidar datasets. In this work, we propose LidarCLIP, a mapping from automotive point clouds to a pre-existing CLIP embedding space. Using image-lidar pairs, we supervise a point cloud encoder with the image CLIP embeddings, effectively relating text and lidar data with the image domain as an intermediary. We show the effectiveness of LidarCLIP by demonstrating that lidar-based retrieval is generally on par with image-based retrieval, but with complementary strengths and weaknesses. By combining image and lidar features, we improve upon both single-modality methods and enable a targeted search for challenging detection scenarios under adverse sensor conditions. We also use LidarCLIP as a tool to investigate fundamental lidar capabilities through natural language. Finally, we leverage our compatibility with CLIP to explore a range of applications, such as point cloud captioning and lidar-to-image generation, without any additional training. We hope LidarCLIP can inspire future work to dive deeper into connections between text and point cloud understanding. Code and trained models available at https://github.com/atonderski/lidarclip.
translated by 谷歌翻译
随着无人机(UAV)和其他遥感设备(例如卫星)的增加数量和可用性,我们最近看到了用于航空视图数据的计算机视觉方法的大幅增加。此类技术的一种应用是在搜索和救援(SAR)中,该任务是在自然灾害之后进行本地化和协助丢失的一个或几个人。在许多情况下,可能已经知道粗糙的位置,并且可以部署无人机来探索一个给定的限制区域,以精确定位失踪人员。由于时间和电池限制,至关重要的是,要尽可能高效地进行定位。在这项工作中,我们通过将其作为空中视图目标本地化任务将其抽象为模拟类似SAR的设置而无需访问实际无人机的框架中来解决此类问题。在此框架中,代理在空中图像的顶部(搜索区域的代理)运行,其任务是本地定位在视觉提示方面描述的目标。为了进一步模仿实际无人机上的情况,代理无法整体观察搜索区域,甚至在低分辨率下也无法观察到搜索区域,因此,它必须仅根据朝目标进行部分瞥见而仅根据朝目标进行操作。为了解决这项任务,我们提出了Airloc,Airloc是一个基于加强学习(RL)的模型,该模型将探索(寻找遥远的目标)和剥削(本地化附近的目标)。广泛的评估表明,Airloc的表现优于启发式搜索方法以及替代性可学习方法。我们还进行了概念验证研究,表明可学习的方法平均要优于人类。代码已公开可用:https://github.com/aleksispi/airloc。
translated by 谷歌翻译
使用麦克风阵列的扬声器定位取决于准确的时间延迟估计技术。几十年来,基于与相变的广义跨相关性(GCC-PHAT)的方法已被广泛用于此目的。最近,GCC-PHAT也已用于为神经网络提供输入特征,以消除噪声和混响的影响,但以无噪声条件下的理论保证为代价。我们提出了一种新的方法来扩展GCC-PHAT,其中使用移位模糊的神经网络过滤接收的信号,该神经网络保留信号中包含的时序信息。通过广泛的实验,我们表明我们的模型始终减少不利环境中GCC-PHAT的误差,并保证在理想条件下确切的时间延迟恢复。
translated by 谷歌翻译